Datacite metadata were pulled using the rdatacite
package in November 2022. Each of the six institutions were searched
using the name of each University in the
creators.affiliation.name metadata field. Results were
filtered to include DOIs with a publicationYear of 2012 or
later, and a resourceTypeGeneral of dataset or software. As
the search terms returned other institutions with similar names, results
were filtered to include DOIs only from the relevant institutional
affiliations.
Following recommendations of the Crossref API, metadata was pulled
from the April 2022 Public Release file (http://dx.doi.org/10.13003/83b2gq). DOIs were searched
records with a created-dateparts year of 2012 or newer,
that had a type of datasets (Crossref does not have
software as an available type), and had an author affiliation with one
of the six institutions.
Upon initial examination of the affiliation data, we realized that our own institutional repositories were not represented in the data because the affiliation metadata field was not completed as part of the DOI generation process.
To pull data shared in our institutional repositories as a
comparison, a second search was performed to retrieve DOIs published by
the institutional repositories at each university. For the institutional
repositories using DataCite to issue DOIs (5 out of the 6 institutions
at the time), the datacite API queried by names of the institutional
repositories in the publisher metadata field. For the one
institution using CrossRef to issue DOIs (Duke), the crossref API was
used to retrieve all DOIs published using the Duke member prefixes.
Institutional repository data was then filtered to include only the relevant repositories, datasets and software resource types, and DOIs published in 2012 or later.
Affiliation data from datacite, affiliation data from cross ref, and the institutional repository data were combined into a single dataset.
Load required packages and read in combined data.
#packages
pacman::p_load(dplyr,
tidyr,
ggplot2,
rjson,
rdatacite,
cowplot,
stringr,
knitr,
DT,
ggbreak)
#Load the combined data from 3_Combined_data.R
load(file="data_rdata_files/Combined_ALL_data.Rdata")
#rename object
all_dois <- combined_dois
#re-factor group so that datacite appears before cross ref
all_dois$group <- factor(all_dois$group, levels = c("Affiliation - Datacite", "Affiliation - CrossRef", "IR_publisher"))
Some repositories (such as Harvard’s Dataverse and Qualitative Data Repository) assign DOIs at the level of the file, rather than the study. Similarly, Zenodo often has many related DOIs for multiple figures within a study. In order to attempt to compare study-to-study counts of data sharing, look at the DOIs collapsed by “container”.
by_container <-
all_dois %>%
filter(!is.na(container_identifier)) %>%
group_by(container_identifier, publisher, title, institution) %>%
summarize(count=n()) %>%
arrange(desc(count))
How many publishers have container DOIs?
by_container %>%
group_by(publisher) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
datatable
Collapsing by container for counts
containerdups <- which(!is.na(all_dois$container_identifier) & duplicated(all_dois$container_identifier))
all_dois_collapsed <- all_dois[-containerdups,]
This leaves a total of 165950 cases.
DOI types by resource
all_dois_collapsed %>%
group_by(resourceTypeGeneral, group) %>%
summarize(count=n()) %>%
pivot_wider(names_from = group,
values_from = count,
values_fill = 0) %>%
kable()
| resourceTypeGeneral | Affiliation - Datacite | Affiliation - CrossRef | IR_publisher |
|---|---|---|---|
| Dataset | 11572 | 147702 | 2103 |
| Software | 4512 | 0 | 61 |
DOI by institutional affiliation/publisher
all_dois_collapsed %>%
group_by(group, institution) %>%
summarize(count=n()) %>%
pivot_wider(names_from = group,
values_from = count) %>%
kable()
| institution | Affiliation - Datacite | Affiliation - CrossRef | IR_publisher |
|---|---|---|---|
| Cornell | 3921 | 706 | 174 |
| Duke | 2372 | 3603 | 225 |
| Michigan | 4188 | 141111 | 645 |
| Minnesota | 2408 | 1700 | 692 |
| Virginia Tech | 1553 | 64 | 333 |
| Washington U | 1642 | 518 | 95 |
Look at all the Institutional Repositories Captured
IR_pubs <- all_dois_collapsed %>%
filter(group == "IR_publisher") %>%
group_by(publisher_plus) %>%
summarize(count = n())
IR_pubs %>%
kable(col.names = c("Institutional Repository", "Count"))
| Institutional Repository | Count |
|---|---|
| Cornell | 174 |
| Duke-Duke Digital Repository | 78 |
| Duke-Research Data Repository, Duke University | 147 |
| Michigan | 10 |
| Michigan-Deep Blue | 515 |
| Michigan-ICPSR/ISR | 109 |
| Michigan-Other | 11 |
| Minnesota | 692 |
| Virginia Tech | 333 |
| Washington U | 95 |
Replace all of these publishers with “Institutional Repository” so that they will be represented in a single bar.
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher_plus %in% unique(IR_pubs$publisher_plus))] <- "Institutional Repository"
#catch the rest of the "Cornell University Library"
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher == "Cornell University Library")] <- "Institutional Repository"
#and stray VT
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher == "University Libraries, Virginia Tech")] <- "Institutional Repository"
#and DRUM
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher == "Data Repository for the University of Minnesota (DRUM)")] <- "Institutional Repository"
##ICPSR is also inconsistent
all_dois_collapsed$publisher[grep("Consortium for Political", all_dois_collapsed$publisher)] <- "ICPSR"
Think we just keep these together for the main analysis…
by_publisher_collapse <- all_dois_collapsed %>%
group_by(publisher, institution) %>%
summarize(count=n()) %>%
arrange(institution, desc(count))
Table of publisher counts
by_publisher_collapse_table <- by_publisher_collapse %>%
pivot_wider(names_from = institution,
values_from = count,
values_fill = 0) %>%
rowwise %>%
mutate(Total = sum(c_across(Cornell:`Washington U`))) %>%
ungroup() %>%
arrange(desc(Total)) %>%
mutate(Cumulative_Percent = round(cumsum(Total)/sum(Total)*100, 1))
by_publisher_collapse_table %>%
datatable
Write out the table of data & software publishers
write.csv(by_publisher_collapse_table, file="data_summary_data/Counts of Publishers By Insitituion - Collapsed by container.csv", row.names = F)
# by_publisher_dc_collapse <- all_dois_collapsed %>%
# group_by(publisher, institution) %>%
# summarize(count=n()) %>%
# arrange(institution, desc(count))
#table of publishers - data
by_publisher_dc_collapse_table <- by_publisher_collapse %>%
pivot_wider(names_from = institution,
values_from = count,
values_fill = 0) %>%
rowwise %>%
mutate(Total = sum(c_across(Cornell:`Washington U`))) %>%
arrange(desc(Total))
Look at publishers based on rank of number of DOIs
by_publisher_dc_collapse_table %>%
group_by(publisher) %>%
summarize(count=sum(Total)) %>%
arrange(desc(count)) %>%
mutate(pubrank = order(count, decreasing = T)) %>%
ggplot(aes(x=pubrank, y=count)) +
geom_bar(stat="identity") +
labs(x = "Publisher Rank (Top 20)", y="Number of DOIs")+
scale_y_break(breaks =c(10000, 100000),scales = .15) +
scale_x_continuous(limits = c(0,20), sec.axis = dup_axis(labels=NULL, breaks=NULL)) +
theme_bw()
Look at the top 10 publishers - how many does this capture?
top10pubs <- by_publisher_dc_collapse_table$publisher[1:10]
by_publisher_dc_collapse_table %>%
group_by(publisher) %>%
summarize(count=sum(Total)) %>%
mutate(intop10pub = publisher %in% top10pubs) %>%
group_by(intop10pub) %>%
summarize(totalDOIs = sum(count), nrepos = n()) %>%
ungroup() %>%
mutate(propDOIs = totalDOIs/sum(totalDOIs)) %>%
kable(digits = 2)
| intop10pub | totalDOIs | nrepos | propDOIs |
|---|---|---|---|
| FALSE | 1428 | 166 | 0.01 |
| TRUE | 164522 | 10 | 0.99 |
top10colors <- c("Harvard Dataverse" = "dodgerblue2",
"Zenodo" = "darkorange1",
"ICPSR" = "darkcyan",
"Dryad" = "lightgray",
"figshare" = "purple",
"Institutional Repository" = "lightblue",
"ENCODE Data Coordination Center" = "gold2",
"Faculty Opinions Ltd" = "darkgreen",
"Taylor & Francis" = "red",
"Neotoma Paleoecological Database" = "pink")
(by_publisher_plot_collapse <- by_publisher_collapse %>%
filter(publisher %in% top10pubs) %>%
ggplot(aes(x=institution, y=count, fill=publisher)) +
geom_bar(stat="identity", position=position_dodge(preserve = "single")) +
scale_fill_manual(values = top10colors, name="Publisher")+
guides(fill = guide_legend(title.position = "top")) +
#scale_y_continuous(breaks = seq(from = 0, to=5000, by=500)) +
scale_y_break(breaks =c(3000, 120000),scales = .15) +
coord_cartesian(ylim = c(0,5000)) +
labs(x = "Institution", y="Count of Collapsed DOIs") +
theme_bw() +
guides(fill = guide_legend(nrow = 3, title.position = "top")) +
theme(legend.position = "bottom", legend.title.align = .5))
ggsave(by_publisher_plot_collapse, filename = "figures/Counts of DOIs by Institution_DOIcollapsed.png", device = "png", width = 8, height = 6, units="in")
by_publisher_percent_plot1 <- by_publisher_collapse %>%
group_by(institution) %>%
mutate(Percent = count/sum(count)*100) %>%
filter(publisher %in% top10pubs) %>%
ggplot(aes(x=institution, y=Percent)) +
geom_col(aes(fill=publisher)) +
scale_fill_manual(values = top10colors, name="Publisher") +
labs(x = "Institution", y="Percent of Total Data DOIs") +
guides(fill = guide_legend(title.position = "top")) +
theme_bw() +
theme(legend.position = "bottom",
legend.title.align = .5)
publegend <- get_legend(by_publisher_percent_plot1)
by_publisher_percent_plot1 <- by_publisher_percent_plot1 + theme(legend.position = "none")
by_publisher_percent_plot2 <- by_publisher_collapse %>%
filter(publisher != "ENCODE Data Coordination Center") %>%
group_by(institution) %>%
mutate(Percent = count/sum(count)*100) %>%
filter(publisher %in% top10pubs) %>%
ggplot(aes(x=institution, y=Percent)) +
geom_col(aes(fill=publisher)) +
scale_fill_manual(values = top10colors, name="Publisher") +
labs(x = "Institution", y="Percent of Total Data DOIs") +
theme_bw() +
theme(legend.position = "none", legend.title.align = .5)
# ggsave(plot = by_publisher_percent_plot1, filename="Percent DOIs Top Publisher Percents - With ENCODE.png", device = "png")
#
# ggsave(plot = by_publisher_percent_plot2, filename="Percent DOIs Top Publisher Percents - No ENCODE.png", device = "png")
(combined_pub_plots <- plot_grid(plot_grid(by_publisher_percent_plot1,
by_publisher_percent_plot2,
labels = c("A", "B")),
publegend,
nrow=2,
rel_heights = c(2,.5),
align = "v",
axis = "t"))
ggsave(plot = combined_pub_plots, filename="figures/Percent DOIs Top Publisher Percents.png", device = "png", width = 10.5, units = "in")
Overall Proportion of Data/Software DOIs in Top 10 publishers by institution
by_publisher_collapse %>%
group_by(institution) %>%
mutate(Percent = count/sum(count)*100) %>%
filter(publisher %in% top10pubs) %>%
group_by(institution) %>%
summarize(TotalCount = sum(count), TotalPercent = sum(Percent)) %>%
kable(digits =2)
| institution | TotalCount | TotalPercent |
|---|---|---|
| Cornell | 4539 | 94.54 |
| Duke | 5936 | 95.74 |
| Michigan | 145629 | 99.78 |
| Minnesota | 4536 | 94.50 |
| Virginia Tech | 1747 | 89.59 |
| Washington U | 2135 | 94.68 |
How many different publishers are researchers sharing their data and how does this change over time?
by_year_nrepos <- all_dois_collapsed %>%
group_by(publicationYear, publisher, institution) %>%
summarize(nDOIs = n()) %>%
group_by(publicationYear, institution) %>%
summarize(npublishers = n(), totalDOIs = sum(nDOIs))
by_year_nrepos %>%
ggplot(aes(x=publicationYear, y=npublishers, group=institution)) +
geom_line(aes(color=institution)) +
labs(x="Year",
y="Number of Repositories",
title="Number of Repositories Where Data and Software are Shared Across Time") +
theme_bw() +
theme(legend.title = element_blank())
We can also look at the data collapsed by version of a record. This was motivated because some repositories have multiple entries for the different versions of the same dataset/collection. And some entries have many versions.
Explore versions
Some Repositories attach “vX” to the doi.
all_dois_collapsed <- all_dois_collapsed %>%
mutate(hasversion = grepl("\\.v[[:digit:]]+$", DOI))
all_dois_collapsed %>%
filter(hasversion == TRUE) %>%
group_by(publisher, hasversion) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
datatable()
Some repositories use the “VersionCount”
all_dois_collapsed %>%
filter(versionCount > 0) %>%
group_by(publisher) %>%
summarize(count=n(), AvgNversions = round(mean(versionCount),2)) %>%
arrange(desc(count)) %>%
datatable()
Some use “metadataVersion”
all_dois_collapsed %>%
filter(metadataVersion > 0) %>%
group_by(publisher) %>%
summarize(count=n(), AvgNversions = round(mean(metadataVersion),2)) %>%
arrange(desc(count)) %>%
datatable()
How to collapse by version? Maybe that’s for another day…
Look at repositories with affiliation and publication years prior to 2014
DataCite released affiliation as a metadata option on Oct 16. 2014. The repositories with affiliations for things published before then may have been back-updated?
What repositories have publications with affiliation before then?
all_dois_collapsed %>%
group_by(publisher, publicationYear) %>%
summarize(count=n()) %>%
arrange(publicationYear) %>%
pivot_wider(names_from = publicationYear,
values_from = count) %>%
arrange(2012, 2013, 2014, 2015) %>%
datatable()
Look at fields from datacite DOIs (as the data contain the most complete metadata from that source).
First, clean to unlist some of the fields
all_dois_collapsed$has_subjects <- unlist(lapply(all_dois_collapsed$subjects, function(x) length(x[[1]])))
all_dois_collapsed$has_dates <- unlist(lapply(all_dois_collapsed$dates, function(x) ifelse(length(x) > 0, nrow(x[[1]]), 0)))
all_dois_collapsed$has_relatedIdentifiers <- unlist(lapply(all_dois_collapsed$relatedIdentifiers, function(x) ifelse(length(x[[1]]) > 0, nrow(x[[1]]), 0)))
all_dois_collapsed$has_sizes <- unlist(lapply(all_dois_collapsed$sizes, function(x) length(x[[1]])))
all_dois_collapsed$has_rightsList <- unlist(lapply(all_dois_collapsed$rightsList, function(x) ifelse(length(x[[1]]) > 0, nrow(x[[1]]), 0)))
all_dois_collapsed$has_descriptions <- unlist(lapply(all_dois_collapsed$descriptions, function(x) ifelse("description" %in% names(x[[1]]), 1, 0)))
all_dois_collapsed$has_geolocations <- unlist(lapply(all_dois_collapsed$geoLocations, function(x) length(x[[1]])))
all_dois_collapsed$has_fundingReferences <- unlist(lapply(all_dois_collapsed$fundingReferences, function(x) ifelse(length(x[[1]]) > 0, nrow(x[[1]]), 0)))
all_dois_collapsed$has_formats <- unlist(lapply(all_dois_collapsed$formats, function(x) length(x[[1]])))
Then create dataset with indicators for whether fields have information in them (only indicates presence of information, not quality of information).
all_dois_collapsed_completeness <-
all_dois_collapsed %>%
mutate(has_id = ifelse(!is.na(id), 1, 0),
has_publicationYear = ifelse(!is.na(publicationYear), 1, 0),
has_URL = ifelse(!is.na(URL), 1, 0)) %>%
select(id, group, institution, publisher, starts_with("has_")) %>%
pivot_longer(cols=has_subjects:has_URL,
names_to = "variable",
values_to = "value") %>%
mutate(value_indc = ifelse(value == 0, 0, 1))
by_publisher_complete_dc <- all_dois_collapsed_completeness %>%
filter(publisher %in% top10pubs) %>%
filter(group == "Affiliation - Datacite",
publisher != "Institutional Repository") %>%
group_by(publisher, variable) %>%
summarize(complete = sum(value_indc), total = n()) %>%
mutate(percent_complete = complete/total*100)
by_publisher_complete_ir <- all_dois_collapsed_completeness %>%
filter(publisher == "Institutional Repository") %>%
group_by(institution, variable) %>%
summarize(complete = sum(value_indc), total = n()) %>%
mutate(percent_complete = complete/total*100)
NOTE: This will not be accurate because Duke metadata came from CrossRef
Look at the proportion of DOIs that have funder references filled out. Because not all data are the result of funding, will look at a similar fields (subjects - non-required field that applies equally to funded and non-funded works, and publication year - a required field to serve as proxy for total numebr of records) as a baseline completeness metric relevant to all DOIs.
by_publisher_complete_ir %>%
rename(publisher = institution) %>%
bind_rows(by_publisher_complete_dc) %>%
filter(variable %in% c("has_publicationYear", "has_fundingReferences", "has_subjects")) %>%
ggplot(aes(x=publisher, y=complete)) +
geom_bar(stat = "identity", aes(fill=variable), position="dodge") +
scale_fill_hue(name="Metadata Field")+
coord_flip() +
labs(x="Repository/Publisher", y="Number of DOIs with completed field") +
theme_bw()
Write out CSV files for each institution:
for (i in unique(all_dois$institution)) {
all_dois %>%
filter(institution == i) %>%
write.csv(file=paste0("data_all_dois/All_dois_", i, gsub("-", "", Sys.Date()), ".csv"), row.names = F)
all_dois_collapsed %>%
filter(institution == i) %>%
write.csv(file=paste0("data_all_dois/All_dois_collapsed_", i, gsub("-", "", Sys.Date()), ".csv"), row.names = F)
}